Skip to main content

Scrapy Pipeline

The Item Pipeline processes and saves data after the Spider yields it. Each item passes through the pipeline stages you define.


Enable a Pipeline

In settings.py, uncomment and configure ITEM_PIPELINES:

# Enable a pipeline (uncomment and set priority)
ITEM_PIPELINES = {
"my_crawler.pipelines.MyCrawlerPipeline": 300,
}

The number (300) is the priority — lower numbers run first. Typical range is 1–1000.


Write a Pipeline Class

Edit pipelines.py:

class MyCrawlerPipeline:
def process_item(self, item, spider):
# Process or save the item here
return item

process_item() is called for every item yielded by the spider. You must return the item (or raise DropItem to discard it).


Example: Save to CSV

import csv

class CsvPipeline:
def open_spider(self, spider):
self.file = open("output.csv", "w", newline="", encoding="utf-8")
self.writer = csv.DictWriter(self.file, fieldnames=["text", "author"])
self.writer.writeheader()

def close_spider(self, spider):
self.file.close()

def process_item(self, item, spider):
self.writer.writerow(item)
return item

Register it in settings.py:

ITEM_PIPELINES = {
"my_crawler.pipelines.CsvPipeline": 300,
}

Multiple Pipelines

You can register multiple pipelines — each item passes through them in priority order (lower number first).

ITEM_PIPELINES = {
"my_crawler.pipelines.ValidatePipeline": 100, # runs first
"my_crawler.pipelines.CsvPipeline": 300, # runs second
"my_crawler.pipelines.DatabasePipeline": 500, # runs last
}

Each pipeline class is independent. A common pattern:

PriorityPipelinePurpose
100ValidatePipelineClean or drop invalid items
300CsvPipelineSave to CSV file
500DatabasePipelineSave to database

If a pipeline raises DropItem, the item stops and does not pass to the next pipeline.